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, — , 1 Introduction 

U Although maximum mutual information (MMI) training has been used for 

^ hidden Markov model (HMM) parameter estimation for more than twenty 

I— I years ([2], [8], [5], [9], and [11]), it has recently become an essential part 

I of the acoustic modeling repertoire thanks to the refinements introduced 

>• by Woodland and Povey ([16] and [H]). The earliest incarnations of MMI 

worked well on small vocabulary tasks with small models, for example digit 
f — recognition. However, one can expect to gain 10-20% in recognition accuracy 

over standard maximum likelihood methods regardless of the size of the task 
^ or the models when using the current methodology, lattice-based MMI. 

Q The machinery of lattice-based MMI consists of a model selection crite- 

^ rion called the MMI criterion and an iterative estimation algorithm called 

>■ the extended Baum- Welch algorithm. This machinery is analogous to - it is 

^ in fact based on - the standard machinery used for maximum likelihood esti- 

^ mation with HMMs, where the model selection criterion is the log-likelihood 

of the training data and the iterative estimation algorithm is the Baum- 
Welch algorithm ([3]). In both cases the estimation algorithm operates on 
the space of all possible model parameters by producing a new estimate of 
model parameters from an original estimate. Also, both of these estimation 
algorithms have been designed so that the model selection criterion is larger 
on the new estimate than it was on the original estimate. Finally, in both 
cases the machinery is operated in the same manner: starting from a choice 
of initial model parameters, we repeatedly apply the estimation algorithm, 
first to the initial choice, next to the result of this, etc., thereby creating a 
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sequence of model parameters. In the case of maximum likelihood estimation 
with the Baum- Welch algorithm the properties of the resulting sequence of 
model parameters are understood and generally good, while in the case of 
lattice-based MMI with extended Baum- Welch these properties have never 
been studied]!] 

Figure [ijis representative of plots that appear in nearly all of the literature 
on MMI. We note that the MMI criterion steadily increases over the twenty 
iterations, while the word error rate (WER) initially decreases, levels out, and 
then begins a slight upward trend with notable oscillation. The conventional 
wisdom has been that this is due to 'over training', i.e., that MMI, whether 
lattice-based or not, somehow over specializes the models to the training 
data at the expense of recognition performance on more general test data. 
Even if one believes this explanation it is worth understanding what the 
mechanism is that is to blame for this over specialization. Is it a property of 
the algorithm, extended Baum- Welch, or a property of the model selection 
criterion, the MMI criterion, or something else entirely that is at the root of 
the problem? This the central question that we will address in this paper. 

One of the starting points of this research was to investigate what hap- 
pens if we run many more iterations of extended Baum- Welch than is typical. 
The motivation was to investigate whether or not the model parameters are 
actually converging. Figure [2] extends the results in Figure [T] by running 
eighty more iterations of extended Baum- Welch. The MMI criterion is by 
design supposed to be more predictive of recognition performance than the 
maximum likelihood criterion. Yet we see that the MMI criterion steadily 
increases while the corresponding recognition performance falls apart. Since 
the MMI criterion has not converged, we also conclude that the models pa- 
rameters have not converged even after 100 iterations. 

This is in stark contrast to what happens with maximum likelihood es- 
timation with the Baum- Welch algorithm. Figure [3] shows the analogous 
experiment. Note that the behavior is much more benign. The log-likelihood 
steadily increases, as theory predicts, but it appears to be converging to 
around —48.9. Also, while the WER oscillates, the amplitude is very small, 
and it too appears to be converging to 17.7%. Note that this, as a practical 
matter, is much more desirable behavior than what we observed in Figure |2j 

^However, the tacit assumption in the Hterature is that the sequence of model param- 
eters produced by extended Baum- Welch does converge. For example the notion weak 
sense auxiliary functions in JJ^ appears to depend upon this convergence. 
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Figure 1: The WER on an independent test set and the MMI criterion on the 
training data during twenty iterations of extended Baum- Welch. The x-axis 
gives the extended Baum- Welch iteration, with x=0 being the mle. 
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Figure 2: The WER on an independent test set and the MMI criterion on 
the training data during 100 iterations of extended Baum- Welch. The x-axis 
gives the extended Baum- Welch iteration, with x=0 being the mle. 
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Figure 3: The WER on an independent test set and the log-hkehhood on 
the training data during 100 iterations of Baum- Welch. The x-axis gives the 
Baum- Welch iteration, with x=0 being initial models. 



There is little to gain by worrying about how long we run maximum likelihood 
estimation, while one has to be very careful to run lattice-based MMI just 
the right number of times. To make matters worse, since the MMI criterion is 
puzzlingly disconnected from test set recognition performance, we are forced 
to use recognition performance on a independent validation set to determine 
when to stop extended Baum- Welch. Aside from being puzzling, this discon- 
nect between the MMI criterion and general recognition performance leaves 
one vulnerable to questions about how to construct an adequate validation 
test set for model selection. These considerations also make lattice-based 
MMI more difficult to fit into a fully automatic assembly-line acoustic model 
factory than maximum likelihood estimation. 

So, again, why does lattice-based MMI behave so differently from maxi- 
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mum likelihood estimation? As we will remind the reader in Section 12.11 for 



practical reasons we actually use an approximate version of the MMI crite- 
rion. In lattice-based MMI there are two aspects to this approximation and 
they are both encapsulated in the use of phone-marked word lattices. The 
first approximation occurs in the calculation of the MMI criterion via Bayes' 
Rule. Instead of summing over all possible transcriptions of a given training 
utterance - which is impossible - we restrict this sum to the transcriptions 
that occur in the corresponding lattice. The second approximation occurs in 
the calculation of the HMM-based acoustic scores which are inputs for the 
calculation of the MMI criterion. Instead of summing over all possible state 
sequences compatible with a given transcription, as the HMM demands, we 
restrict this sum to a subset of state sequences that are compatible with the 
phone-level time boundaries - the phone-marks - that occur in the corre- 
sponding lattice. The resulting approximation to what we might call the 
true MMI criterion is what extended Baum- Welch actually uses as a model 
selection criterion in lattice-based MMI. 

In this paper we will demonstrate that the properties of lattice-based 
MMI depend on the properties of this approximation. The usual practice 
is to first generate the phone-marked word lattices using the mle seed mod- 
els, then use these fixed lattices throughout multiple iterations of extended 
Baum- Welch. This results in the behavior displayed in Figure |2} In Sec- 
tion 3.1 we will demonstrate that difference between the approximate and 
true MMI criteria is very small at the mle but steadily increases as the model 
parameters move away from the mle. We will also demonstrate that the ap- 
proximate MMI criterion appears to attain its maximum value not within 
the model parameter space but, instead, at a point at infinity. This suggests 
that model estimation using the approximate MMI criterion is an ill-posed 
problem. It also means that the approximate MMI criterion is a terrible 
choice to perform dual roles, first as an approximation to the MMI criterion 
and second as a model selection criterion, since by design it will produce 
estimates for the model parameters that are far from the mle where it is no 
longer related to the MMI criterion. Extended Baum- Welch obliges these 
properties by producing a sequence of parameters that heads to a point at 
infinity with steadily increasing approximate MMI criterion. In this respect, 
the algorithm, extended Baum- Welch is blameless, it is merely obeying the 
pathological demands of the approximate MMI criterion. 

In Section 3.2 we explore what happens if we use a much better approxi- 
mate MMI criterion. We accomplish this by regenerating the lattices between 
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each iteration of extended Baum- Welch. The resulting approximate MMI cri- 
terion is very close to the true MMI criterion at each iteration of extended 
Baum- Welch. We observe that the resulting behavior of lattice-based MMI 
is much more benign, similar to what we observe with maximum likelihood 
estimation. In particular nothing that could be labeled 'over fitting' occurs. 

The remainder of this paper is organized as follows. In Section [2] we intro- 
duce the notation that we will be using, details concerning the approximate 
MMI criterion, and the particulars of our experimental set-up. Section [3] 
describes and analyses the behavior of extended Baum- Welch using three 
different approximate MMI criteria. We then wrap up the main body of the 
paper with a discussion in Section |4j Finally, we have also include three 



appendices that further analyze the experiment in Section 3.1 they cover an 



alternate analysis of the behavior of the model parameters in Appendix |X| 
the effect on our results of a parameter that controls the behavior of extended 
Baum- Welch in Appendix [B| and preliminary results concerning MPE in Ap- 
pendix O 
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2 Preliminaries 

2.1 The approximate MMI criterion 

Let Xi, X2, . . . , Xn be a sequence of random, d-dimensional acoustic vectors, 
which we will abbreviate by X and let be a random transcription taking 
values in W. We let {x,w) be the acoustics and the transcription of the 



7 



training data that we will describe in Section 2.2 We will denote the HMM- 
based probability model for X, the acoustic model, by fg, where 9 are the 
model parameters that take values in B. We will denote the probability 
distribution, or language model, for W, simply by pj^ 

The earlier versions of MMI used the conditional likelihood of the training 
data, pg{w I a;), as a model selection criterion which is given via Bayes' rule 
by 

. I ^ fe{x\w)p{w) 

Peyw x) = ^ 2.1 

Ewew/e(a^ I w)p(w) 

In principle the criterion that lattice-based MMI uses for model selection is 
a scaled version of pe{w \ x), which we denote by Pd{w \ x; k), that is defined 
by 

pe{w\x;n)= /^(" ■ ")'^H . (2.2) 
Ewew/e(a; I w)«p(w) 

The scale k, which is known as the language model scale, is used in all prac- 
tical recognition systems to balance the relative weights of the probabilities 
obtained from the language model and the acoustic model. Since something 
analogous to the conditional likelihood pe{w \ x; k) is used for hypothesis 
selection during recognition, it is also natural to use it as a model selection 
criterion]^ In reality, however, lattice-based MMI actually uses an approxi- 
mation to pg{w I x; n) for model selection. This approximation, which has 
two aspects, is the result of efficiencies that extended Baum- Welch is able to 
make based on two properties of the lattices. 

The first aspect of the approximation, which is common to all of the 
versions of MMI, involves the word level properties of the lattices and the sum 



in the denominator term of (2.2). Instead of summing over all the possible 
transcriptions in W, which is impossible except in the simplest tasks, we 
restrict ourselves to a finite subset Vg^^ C W. The subset Vg^ is obtained by 
keeping all of the hypotheses with probability bigger than some small e > 



^In practice there are really two language models in play in this paper. The first 
is used exclusively on the training data for lattice generation, discriminative training, 
and recognition. Following current standard practice (see [13] or [H]), p is a relatively 
weak bigram language model estimated from the training transcriptions w. The second 
is used exclusively on our independent test set. This distribution is a larger, bigram 
language model estimated from transcriptions disjoint from both the test and training 
transcriptions. 

■^This idea was first described in [121. 



8 



during a recognition over the training data using the seed acoustic models 
6q. The correct transcript, w, is also added to Ve^. We may think of Voq as 
essentialljj^ being defined by 

Ve„ = {w G W : peo(w | a;; > e} U {w}. (2.3) 

We use Vbq to construct an approximation to pe^- \ x; k) that we denote by 
pg{- I x; V^o) and that we define for any v e Vqq by 

I x; K, Veo) = — ^ , , ,1 , (2-4) 

The approximation | x; k,Vbo) ^^^^ ^ probability mass function, how- 
ever prima facie not for W, instead for V, where V is the restriction of W 
from W to Veo- Of course, we can extend pg{- \ x; k, Veg) to be a probabil- 
ity mass function for W by setting it to zero on W \ Ve,). How good the 
approximation 

pg{- \ x; k) ^ pe{- \ x; n,Veo) (2-5) 
is for arbitrary 6 depends on how large the missing probability density given 

by 

J2 fe{x\w)-^p{w) (2.6) 
weW\Ve(, 



is. At the seed models 9q, provided that we have chosen the e in (2.3) 
small enough, the corresponding sum in (2.6) should be very small, so the 
approximation (2.5) should very close to an identity. Hence for all w G W 

Peo(w \x;k) = peo(w | x; k, V^J. 

The second aspect of the approximation involves the phone level proper- 
ties of the lattices and the allowable state sequences in the definition of the 
hidden Markov model fg. For any w G W, let S'w denote the set of hidden 
state sequences, s, that are compatible with transcription w and have the 



"^This is not precisely correct since a practical recognizer is unable to consider all pos- 
sible w £ W except on tasks like isolated digits. Thus in general there will be some 
w G yy \ with (w \ x; k) > e. These w are often called search errors. 
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same number of frames as x, namely nj^ One facet of the model /^(x | w) is 
that we need to perform the following sum 

/,(a;|w)= (2.7) 



Unfortunately, in the denominator of (2.4) we need to compute fg{x \ v) 



for every v G V^o, which means that we need to efficiently evaluate sums 



analogous to (2.7) for all the state sequences in S^, where v ranges over the 
large set Vg^. Lattice-based MMI gets around this problem by making use 
of so called phone-marked word lattices. Effectively this means that each 
transcription w G Ve^ has been force ahgned using the seed models 6^0, but 
only the times at the phone boundaries - these are the phone-marks - are 
kept from the alignments. The approximation involves restricting the sum 



in (2.7) to a subset -Rw C S^^ that respects the phone-marks, in the sense 
that each phone's HMM is anchored at the start and end times given by the 
corresponding phone-marksj^ If we let 

T^eo = {Rv}veVgg, 

then given v G Vg^ we define an approximation to fg{x \ v), gg{x \ v;7lgg), by 

gg{x I ^';7^0J = ^ feix,s). 

The combination of these two approximations results in the following 
definitioiJll 

/ I i; -o A geix \v;ngo)^p{v) 

pg{v I X; K,Vgo,KgJ - 



T.^eVe,,9e{x \ v;7^eo)-p(v) 



^For example, in this paper we will be using HMMs to model triphones. So to con- 
struct the set iSw we first take all the phone-level pronunciations consistent with w, then 
expand them to produce corresponding triphone level pronunciations, and finally enumer- 
ate the all of the state sequences with length n consistent with all of the possible triphone 
pronunciations. 

^This is accomplished in extended Baum- Welch by accumulating statistics using an 
phone-arc version of the forward-backward algorithm. This is described in detail in 

^The notation that we are using for V^q and TZg^ is ambiguous because TZe„ depends on 
and both depend on x. Fortunately, from now on, all these quantities will be appearing 
together, e.g. in pe{v \ x; K,Veg,TZgg), thus eliminating any ambiguity. 
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The approximation pg{- \ x; k, V^q, T^Qq) to po{- \ x; k) is also a probability 
mass function for V which we extend to W by setting it to zero on W \ 
V^o- The approximate conditional likelihood of the training data, p0{w \ 
X] V^o, 7^6io), is what lattice-based MMI actually uses as a model selection 
criterion. 

If and TZg^ are large enough, then for any w G W the approximation 
Pe{w I x;k) ^peiw \ x; K,Veo,'R-eo) (2-8) 
should, in fact, be an equality at 6^0, namely 

Peo(w I x; k) =P0o(w | x; k, Ve^, 7^eJ. 
This statement follows from construction. How good the approximation in 



(2.8) remains as 9 moves away from Oq is one of the central questions in this 
paper. Note that this approximation is only valid given the acoustic training 
data x. 

We shall be using the mle, which we denote by 9mie, to generate the 
transcriptions and phone-marks in various combinations in our experiments. 
Instead of writing Vg^^^ or TZe^i^ we shall simplify these by writing Vmie or 
TZmie instead. 

We shall refer to the quantity 

imm{x,w;6,K,Veo,'R'eo) =^ogge{x \ w;TZeo)^p{w) 
as the numerator log-likelihood and the quantity 

den(a;;6', K, Veo,7^eo) = log ^ gg^x \ w;7^0J«p(w) 

\w6Veg 

as the denominator log-likelihood. 



2.2 Experimental preliminaries 

In this Section we give the details that all of our experiments share. We 
chose to work on a standard Wall Street Journal (WSJ) task from the early 
1990's ([10], [Z]) because, by modern standards, it is small enough so that 
experimental turnaround is fast even with MMI, but it is large enough so that 
the results are believable. This task is also self-contained, with nearly all of 
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the materials necessary for training and testing available through the LDC, 
the exception being a dictionary for training and testing pronunciations. We 
use pronunciations created at VoiceSignal Technologies (VST) using 39 non- 
silence phones. We use version 3.4 of the HTK toolkit to train and test our 
models. 

We use the WSJ SI-284 set for acoustic model training. This training 
set consists of material from 84 WSJO training speakers and from 200 WSJl 
training speakers. It amounts to approximately 37000 training sentences 
and 66 hours of non-silence data. Each session was recorded using two micro- 
phones; we use the primary channel recorded using a Sennheiser microphone. 

The VST front-end that we use produces a 39 dimensional feature vector 
every 10 ms: 13 Mel-cepstral coefficients, including cO, plus their first and 
second differences. The cepstral coefficients are mean normalized. The data 
is down-sampled from 16 kHz to 8 kHz before the cepstral coefficients are 
computed. 

We use very small, simple acoustic models to lower the computational 
load for MMI. The acoustic models use word-internal triphones. Except for 
silence, each triphone is modeled using a three state HMM without skipping. 
For silence we follow the standard HTK practice that uses two models for 
silence: a three state tee-model and a single state short pause model; the 
short pause model is tied to middle state of the longer model; both models 
allow skipping]^ The resulting triphone states were then clustered using 
decision trees to 1500 tied states. The output distribution for each tied state 
is a single, multivariate normal distribution with a diagonal covariance. 

We report word error rate (WER) on two test sets. Using the nomen- 
clature of the time, these test sets use the 5k closed vocabulary and non- 
verbalized punctuation. The first test set is the November 1992 ARPA eval- 
uation 5k test set. It has 330 sentences collected from 8 speakers. The 
second test set is referred to as si_dt_05.odd in [15]. It is a subset of the 
the WSJl 5k development test set defined by first deleting sentences with 
out of vocabulary (OOV) words (relative to the 5k closed vocabulary) and 
then selecting every other sentence. This results in 248 sentences collected 
from 10 speakers. Together, these two test sets amount to about an hour 
of non-silence data. We test using the standard 5k bigram language model 
created at Lincoln Labs for the 1992 ARPA evaluation. The combined WER 
rate on these test sets using the models described above is 18%. 

^See [n] for details. 
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When we refer to MMI training, we mean lattice-based extended Baum- 
Welch as described in [llj or [16j. We use HTK 3.4 to perform extended 
Baum- Welch with standard settings, e.g, E = 1. We update the means 
and variances, but do not update the transition probabilities. We use a 
VST tool and a relatively weak bigram language model estimated from the 
acoustic training sentences (we kept bigrams that had 8 or more examples) 
to generate word lattices on the training set. We use HTK tools to create 
phone-marked numerator and denominator word lattices, the latter starting 
from the word lattices described above. The language model scores in the 
phone-marked word lattices come from the weak bigram language model that 
was used to generate the word lattices. 

For each feature we create a variance floor set to 1% of the total variance of 
that feature in the training data. These floors are respected during maximum 
likelihood and MMI training. 

In all of our experiments, k = 16. The HTK extended Baum-Welch 
software reports the per frame average of three quantities, namely, the nu- 
merator and denominator log-likelihoods, as well as their difference, i.e., the 
logarithm of the approximate MMI criterion. When we report results we too 
shall report per frame averages of these quantities. 

2.3 A remark concerning G 

Since we are only updating the model means and variances during MMI we 
shall think of the model parameter space B as consisting of just the space of 
model means and variances. Even with the small 1500 state unimodal models 
that we are using, is still quite large: on the order of 10^ dimensional. In 
general, if we define 7/39 = {cr^ G M^^ : > for 1 < z < 39}, then each the 
variance and mean for state j, ((t|, fij), range over the product ^39 x M^^, so 

1500 

e = n (■^39 X R'') 

which is an open set. But as a practical matter, we are flooring the variances: 
if we let ?/ G M be the variance floor, i.e. for each i with 1 < i < 39 we have 
yi > 0, and we let 7^39 = e M^^ : > for 1 < i < 39} be the closure 
of 7^39, then in actuality 

1500 

0=n((^39 + 2/)xM39) 

i=i 
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which is a closed set. 

At several points in the paper it will be useful to refer to the distance 
between model parameters. We will use the usual Euclidean distance on 9, 
denoted || ■ ||, to measure these distances. 

3 Experimental results 

All of the experiments described in this section start from the mle, Omie, 
then construct a sequence of models parameters [9^) by iteratively applying 
the extended Baum- Welch algorithm 100 times. The experiments differ in 
what lattices each iteration of extended Baum- Welch uses. In the first ex- 
periment we shall follow the standard procedure in the hterature that uses 
lattices generated the mle for each iteration of extended Baum- Welch. This 
corresponds to using ^^(w | x; k, Vmie, "T^mZe) as the model selection criterion 
during each iteration of extended Baum- Welch. In the second experiment we 
shall regenerate the lattices between each iteration of extended Baum- Welch. 
This corresponds to using pe{w \ x; Ve^., T^gj.) as the model selection crite- 
rion during iteration -|- 1 of extended Baum- Welch. In the third and final 
experiment we shall use the word lattices generated by the mle but we shall 
regenerate the phone-marks between each iteration of extended Baum- Welch. 
This corresponds to using pe^w \ x; n, Vmie, T^eJ as the model selection crite- 
rion during iteration k + 1 oi extended Baum- Welch. 

3.1 Fixed lattices 

In this experiment we generate the lattices once and for all using the mle, 
then run 100 iterations of extended Baum- Welch. Figure [2] in the introduc- 
tion shows the results of this experiment, which we redisplay for the reader's 
convenience in Figure |4j It is worth noting that 14 iterations of extended 
Baum-Welch reduces the WER of the mle, 17.7%, to 11.6%. This is a re- 
markable reduction in WER, which illustrates the utility of lattice-based 
MMI. Unfortunately after 20 iterations the WER starts to steadily increase 
on the test data. Indeed after 60 iterations the WER has exceeded that 
of the mle, and after 100 iterations the WER is 42.4% which is nearly 2.5 
times worse than the WER of the mle. This is in spite of the fact that the 
approximate MMI criterion is steadily increasing during these 100 iterations. 
The approximate MMI criterion that is displayed in Figure |4] is the sequence 
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Figure 4: The WER on an independent test set and the MMI criterion on 
the training data during 100 iterations of extended Baum- Welch. The x-axis 
gives the extended Baum- Welch iteration, with x=0 being the mle. 



(logpe^(w I X]K,Vmle,Tlmle))- 

Note that during the first 14 iterations the WER steadily decreases, in 
synchrony with the steady increase in the approximate MMI criterion. By 
construction the approximation 

I X;k) ^pe{w \ X; K,Vmle,'R'mle) (3.1) 

should be quite reasonable for model parameters 6 near the mle, 6mie- Our 
hypothesis is that as extended Baum- Welch proceeds it gradually pushes the 



sequence {9k) away from the mle into a region where the approximation (3.1 ) 
is no longer valid. 

We supply further evidence for this hypothesis by running recognition 
on the training data using the sequence of model parameters produced by 
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extended Baum- Welch, (6'^), and two different recognition methods. To per- 
form recognition using the models 9k in the first method, which we refer to 
as 'method A', we first generate the phone-marked training lattices using 9^- 
We then pick the best path through the phone-marked word lattices, which 
corresponds to choosing a transcription w^ G W according to the rule 

w^ = argmaxpe^(w | x; k, Vg^, T^gJ. 



Since, as we have remarked in Section 2.1 , the distributions pe^ (■ | x; k, Ve^. , T^e, 



and pe^i.' I ^'^ ^) should be essentially the same, this amounts to constructing 
a sequence of recognition transcriptions (w^) according to the rule 

w^ = argmaxp6ij.(w | x; k). (3.2) 

In the second method, which we refer to as 'method B', we first rescore the 
phone-marked word lattices generated by the mle - the lattices that we used 
during extended Baum- Welch - with the acoustic model with parameters 9^, 
then pick pick the best path through the resulting lattices. This amounts to 
constructing a sequence of recognition transcriptions {v\) G Vmie according 
to the rule 

vl = arg max pe^^v \ x; k, Vmie, Tlmie)- (3.3) 

Table [T] displays the results of these recognitions as extended Baum- Welch 
proceeds. The pattern of the WER of the recognition of the training data 
using method A is very similar to the pattern that we saw on the independent 
test data, but these patterns are very different from the pattern of the WER 
of the recognition of the training data using method B. In particular, the 
sequence of WERs generated by method B steadily decreases, apparently 
headed towards 0% WER. 

As extended Baum- Welch proceeds the probability of the reference train- 
ing transcription, pe^iw \ x; K,Vmie,T^mie), steadily increases, while the se- 



quence of transcriptions recognized by the rule (3.3), (vl), has steadily de- 
creasing WER. Since the WER is steadily decreasing, the sequence of recog- 
nized transcriptions, (vl), is moving closer to the reference transcription w. 
It follows that extended Baum- Welch is making the reference transcription, 
or a transcription very close to it, the most likely transcription under the 



rule (3.3) which uses the approximation pgi^{- \ x; K,Vmie,'^mie) for 3> 0. 



In contrast, the WER of the sequence of transcriptions recognized via the 
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iteration 


Method 
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28.3 


28.3 
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9n 
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30 


14.9 


9.3 


40 


15.2 


8.1 


50 


15.6 


7.2 


60 


19.3 


6.1 


70 


23.8 


5.2 


80 


33.2 


4.2 


90 


45.7 


3.2 


100 


57.3 


2.6 



Table 1: WER on training data during 100 iterations of extended Baum- 
Welch using two different recognition methods. 



rule (3.2), (w|.) is steading increasing after iteration 30 which means that 
this sequence of recognized transcriptions must be steadily moving further 
away from the reference transcription w. Since the probability distribu- 
tions pe^{- I x; K, Vmie, T^mie) and P6»fe(' I k) are making drastically different 
choices under their respective recognition rules for large fc, it follows these 
probability distributions must be very different after many iterations of ex- 
tended Baum- Welch. 



Why does the approximation in (2.8) break down for A; ^ 0? To answer 



this, we examine the sequence of numerator log-likelihoods 

(num(x, W] 9k, n, Vmle, T^rnle)) 

and the sequence of denominator log-likelihoods 

(den(x, w; 9k, n, Vmie, T^mie)) 

over the 100 iterations in Figure |5] Since the sequence of numerator log- 
likelihoods is steadily decreasing, the sequence of parameters {9k) must be 
steadily moving away from 9mie- The sequence of denominator log-likelihoods 
is not only steadily decreasing but is also moving closer to the the sequence 
of numerator log-likelihoods. From the definitions of the numerator and 
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Figure 5: Numerator and denominator log-likelihoods during 100 iterations 
of extended Baum- Welch. 



denominator log-likelihoods, it follows that 



(3.4) 



for large k and every transcription w G Vmie different from the reference 
transcription w. This property of ge^ and Vmie is not shared by fe^ and W. 
We have already shown that for large k the transcript w^ G W recognized 



by rule (3.2) is very different from w, hence the the analog of (3.4) does not 



(3.5) 



hold for since wl w and 

We conclude that the reason that probability distributions pe, 
and psfc (■ I n) are different is because by restricting the relevant sums in the 
definition of pe^{- \ x; n,Vmie,T^mie) to TZmie and Vmie leads to ignoring state 
sequences in 5 \ TZmie or transcriptions in W \ Vmie that have zero probability 
under 9mie but non-trivial probability under 9k. 

However the situation is actually much worse than this analysis suggests. 
To see this we turn to investigating the convergence properties of {9k)- As 
we noted in the introduction, Figure |4] shows that the the sequence of model 
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parameters {6k) cannot have converged during these 100 iterations, since the 
sequence {logpoi^^w \ x; n, Vmie, T^mie)) has clearly not converged. If the trend 
exhibited in Figure [5] continues, then 

lim imm{x,w; 6k, K,Vmie,'R-mie) = lim den{x, w; 6k, n,Vmie,'R'7nie) = -oo. 

fe— ^-oo A;— >oo 

(3.6) 

It (3.6 ) were true, then the continuity of num(x, w; 6, k, Vmie,T^mie) as a func- 
tion of 6 would show that not only is the sequence {6k) not convergent, but 
that it does not remain within any given compact set. Furthermore, as we 
remarked in Section 2^ because we are flooring the variances G is in fact 
closed, so it would follow that the sequence {6k) must be unbounded. More 
precisely, the sequence of model parameters {6k) created by 100 iterations of 
extended Baum- Welch is consistent with the following conjecturep 



Conjecture 3.1. Let {6k) E Q he the sequence of model parameters defined 
inductively by 6q = 6mie, o,nd 6k+i is obtained from 6k by using extended 
Baum-Welch with pg{w \ x; K,Vmie,'J^mie) as the model selection criterion. 



Then 



and 



lim \\6k\\ = oo (3.7) 



A;— ^oo 



lim logpe^ {w \ x; k, Vmie, Tlmie) = 0. (3.8) 

A;— ^oo 



Conjecture 3.1 suggests that the approximate MMI criterion, p0{w \ 
X] K,Vmie,'^mie), IS a terrible choice for a model selection criterion. This 
is because the problem 

6 = argmaxpe(w | x; K,Vmie,'J^mie) (3.9) 
6>ee 

admits an absurd solution, namely a point at infinity with 

{W I X; K,Vmle,'Tlmle) = 1- 



^The observed conditional log-likelihood of the training data does change dramatically 
after 100 iterations of extended Baum-Welch, in fact decreasing in absolute value by a 
factor of 14, which means that the relationship between the conditional likelihoods before 
and after 100 iterations of extended Baum-Welch is given by is 

1 

log 

Thus the probability mass is considerably more concentrated on w under ^loo than it was 
to begin with under 9mie- 
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In other words, this would mean that the model selection problem in (3.9) 
is ill-posed. On the other hand, the algorithm, extended Baum- Welch, ap- 
pears to be blameless since it is doing what the apparently ill-posed problem 
requires it to do. 

One possible mechanism that extended Baum- Welch could use to drive 
the numerator and denominator log-likelihoods to big - but not infinite - 
negative values is to simply push a large number of the model variances to 
the variance fioor. We rule this mechanism out by observing that none of 
the mle model parameters have fioored variances, while after 100 iterations 
of extended Baum- Welch only 10 out of the approximately 59,000 variances 
have been fioored. In Appendix |A] we give a more detailed analysis of what 
happens to the model parameters after 100 passes of extended Baum- Welch. 
In particular, we show that there is an significant expansion of the space that 
the means occupy. 

Finally, as we noted in Section 2.2, the extended Baum- Welch parameter 
E was set to 1 in this experiment. It is natural to ask if larger values of 
E would result in different results. In particular, we might speculate that 
it is possible to choose E large enough to guarantee convergence of the re- 
sulting sequence of model parameters {9k)- In Appendix [B] we examine this 
question. There we find that increasing E only slows the behavior that we 
have observed in this section, with the fundamental problem of parameter 
divergence remaining with larger choices for E. However, this result should 
not be surprising since it is consistent with the results of this section: the 
parameter divergence, among the other deficiencies of lattice-based MMI, is 
due to properties of the approximate MMI criterion and not due to properties 
of extended Baum- Welch. 



3.2 Regenerated lattices: word and phone-marks 

In this experiment we regenerate the phone-marked word lattices between 
each iteration of extended Baum- Welch. Thus at iteration A; + 1 we are using 
Peiw I x; KjVe^jTZgJ as the model selection criterion when we estimate the 
parameters 6k+i- Starting from the mle, we run extended Baum- Welch 100 
times. We observe the following: 

(1) The logarithm of the approximate MMI criterion, \ogpgi^{w \ x; k, Vg^ , T^e^. ) , 
increases for the first 30 iterations and then starts to oscillate. It 
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iteration 




mle 


28.3 


10 


17.4 


9n 
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30 


13.3 


40 


12.7 


50 


12.5 


60 


12.5 


70 


12.4 


80 


12.3 


90 


12.3 


100 


12.3 



Table 2: WER on the training set during 100 iterations of extended Baum- 
Welch. 

reaches its peak value after 50 iterations and then continues its os- 
cillation. See Figure [6j 

(2) The WER on the test set steadily decreases for the first 25 iterations, 
then continues to decrease but with oscillation. After 36 iterations 
the WER reaches its minimum value, 10.1%, and then oscillates from 
10.1% to various values ranging from 10.4% to 10.9%. See Figure |6} 

(3) When we when measure the WER on the training data to three sig- 
nificant figures it steadily decreases and in fact appears to converge to 
12.3%. See Table [2j However, when we measure this WER to four 
significant digits, there is a small oscillation of ±0.05%. 

Using a matched pairs test ([4J), the minimum WER on the test set in this 
experiment, 10.1%, is significantly better than that of the previous exper- 
iment, 11.6%, at a confidence level < 0.001 (the smallest non-zero p-value 
that we can detect with our software is 0.001). But the difference in the rel- 
ative reduction in the WERs 17.7 11.6 = 34% versus 17.7 -> 10.1 = 43% 
are similar. Perhaps the improvement that we are seeing is artifact of the 
inherent noise in WERs obtained using simple models on small test sets. 
For model parameters 6 near 6k the approximation 

peiw \x;k,)^ peiw \ x] K,Ve^,'Rek) (3.10) 
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Figure 6: MMI criterion on the training set during 100 iterations of extended 
Baum- Welch with lattice regeneration. The x-axis gives the extended Baum- 
Welch iteration, with x=0 being the mle. 
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should be very good. So by regenerating the phone-marked word lattices 
between each iteration of extended Baum- Welch, we are effectively using 
something very close to pelw \ x; k) as our model selection criterion for each 
iteration of extended Baum- Welch. Figure [6] shows that while the sequence 
[pOkiw I x; k)) is oscillating for k large, it is not being driven to 1. In other 
words, the model selection criterion p0{w \ x; k) is not exhibiting the patho- 
logical behavior that pe^w \ x; k, Ve,^;^, 7?.6i„,J exhibits. Also, the sequences 
of word error rates on the independent test set and the training data behave 
similarly as extended Baum- Welch proceeds and are consistent with the se- 
quence {{pgi^{w \ x; K,)) . In particular we do not any phenomena that might 
be labeled 'over training' or 'over fitting'. 

Finally, even though we are seeing much more benign behavior than in 
the previous experiment with fixed lattices, neither the approximate MMI 
criterion has converged after 100 iterations of extended Baum- Welch, nor 
has the sequence of underlying parameters, (Ok), converged. However, in 
contrast to the previous experiment with fixed lattices. Figure [6] suggests 
that in the limit the sequence {6k) is orbiting a compact limit set. 



3.3 Regenerated lattices: only phone-marks 

In this experiment we generate the word lattices once and for all using the 
mle, but we regenerate the phone-marks in the lattices between each it- 
eration of extended Baum- Welch. Thus at iteration -|- 1 we are using 
Pe{w I x; K,Vmie,T^ek) as the model selection criterion when we estimate the 
parameters 6k+i- Starting from the mle, we run extended Baum- Welch 100 
times. The results we observe are a less extreme version of what we saw in 



Section 3.1, namely the approximate MMI criterion steadily increases, while 
the WER on the independent test set steadily decreases until it reaches its 
minimum of 11.5% at iteration 15, levels out between iterations 16 and 22 
where it finally then starts to gradually increase. Significant oscillations in 
the WER begin after iteration 47. See Figure [7j So just as in the experi- 



ment with fixed lattices in Section 3.1, we observe a disconnect between the 



steadily increasing approximate MMI criterion and the increasing test set 
WER. 



Following the example set forward in Section 3.1, we run recognition on 



the training data using two different recognition methods. The first method 



is exactly the same as method A in Section 3.1, namely we use the distri 



bution poki' I x;k) to select our recognition transcription at iteration k via 
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Figure 7: MMI criterion on the training data during 100 iterations of ex- 
tended Baum- Welch with lattice regeneration. The x-axis gives the extended 
Baum- Welch iteration, with x=0 being the mle. 
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Table 3: WER on training data during 100 iterations of extended Baum- 
Welch using two different recognition methods. 



the rule (3.2). The second method, which we refer to as 'method C is anal- 
ogous to method B in Section |3.1[ except that here we use the distribution 
I x; K,Vmie,'^ek) select the recognition transcription instead of the 
distribution pg^{- \ x; K,Vmie,T^mie) in the rule (3.3). The resulting recogni- 
tion results are displayed in Table |3| Notice that the sequence of WERs are 
both slowing decreasing over this range. An argument analogous to what we 
presented in Section 3.1 shows that the probability distributions pe^i' I i^) 
and p0^,{■ I x; K,Vmie,'J^ek) niust be very different for A; ^ 0. The only pos- 
sible explanation for why these distributions are different is: by restricting 
the relevant sum in the definition of pei^{- \ x; KjVmicT^k) to Vmie leads to 
ignoring transcriptions in W \ Vmie that have zero probability under 9mie but 
non-trivial probability under 6k- 

While the WER on the independent test set is increasing in Figure [7] 
after iteration 22, the increase is very gradual and the oscillation in the 
WER is large. We decided to run 290 more iterations of extended Baum- 
Welch to see if the minimum value in the oscillating WER ever exceeds that 
of the mle, namely 17.7%. Figure [s] displays the results of the resulting 390 
iterations from the beginning. At iterations 337-341 the oscillation in the 
WER temporarily stops, with the WER at 18.2%. 

In Figure [8] we see that the approximate MMI criterion begins to oscillate 
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Figure 8: WER on the test set during 100 iterations of extended Baum- 
Welch with lattice regeneration. The x-axis gives the extended Baum- Welch 
iteration, with x=0 being the mle. 
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after about iteration 100. The test set WER also begins a secondary oscilla- 
tion at about that iteration. This is easier to see in Figure [9] Note that even 
though the overall trend in approximate MMI criterion and test set WER is 
upward until about iteration 300, which means that there is a disconnect in 
the approximate MMI criterion from the test set WER, the oscillations in 
the approximate MMI criterion are connected to the secondary oscillations 
in the test set WER. For example at iteration 371 the approximate MMI 
criterion is at the bottom of a steep valley, while the test set WER is at the 
top of a peak (19.2%). 

Finally we note that while the overall trend in both the approximate MMI 
criterion and the WER appear to be converging - both trends start to level 
out at around iteration 300 - the amplitudes of the secondary oscillations 
appear to be increasing. Thus, even after nearly 400 iterations, the sequence 
of model parameters is not close to converging. Also, because the secondary 
oscillations are increasing in amplitude, it is possible that the sequence of 
model parameters is heading to a point at infinity, but more extended Baum- 
Welch iterations and further analysis would be required to settle this point. 



3.4 Further analysis of the three experiments 

In this section we present an analysis that shows that if we use peiw \ 
x; K, Vmie, T^mie) as our approximation to the MMI criterion, then 

(1) For the first 9 iterations of extended Baum-Welch, this approximate 
MMI criterion consistently overestimates the value of its choice of model 
parameter relative to the value that the true MMI criterion would place 
on it. We express this mathematically, in the range < < 8, by: 



(2) This overestimation is solely due to ignoring alternative transcriptions 
that the mle placed zero probability on. 

(3) Significant errors in this approximation due to ignoring alternative state 
sequences, that the mle placed zero probability on, do not occur until 
after iteration 27. 



Figure 10 compares the WERs on the independent test set as extended 



Baum-Welch proceeds using the three approximate MMI criteria: pe{w 
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Figure 9: WER on the test set during 100 iterations of extended Baum- 
Welch with lattice regeneration. The x-axis gives the extended Baum- Welch 
iteration, with x=0 being the mle. 
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Figure 10: Comparison of WERs on the independent test set during 100 
iterations of extended Baum- Welch with the three types of input lattices. 
The X-axis gives the extended Baum- Welch iteration, with x=0 being the 
mle. 



x; K, Vmie, Tlmie) which is labeled 'Fixed lattices', peiw \ x; k, Vmie, T^k) which 
is labeled 'Regenerate phone-marks', and peiw \ x; k, Vk, Tlk) which is labeled 
'Regenerate all'. These WERs are all essentially the same until iteration 10, 
when the 'Regenerate all' WERs separate from the other two. We infer 
from this that the corresponding sequences of model parameters follow the 
same pattern: all three sequences of model parameters are essentially the 
same until iteration 10, when the sequence created using the model selection 
criterion peiw \ x; k, Vk.Tlk) separates from the other two sequences. 



Similarly, Figure 11 compares the values of the three approximate MMI 
criteria as extended Baum- Welch proceeds. For the first nine iterations all 
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three sequences of model parameters are the same, but in the range < A; < 8 

Pek+i{w I x; K,,Vk,TZk) < Pe^^+^iw \ x; K,Vmie,'nk) = Pe^+iiw \ x; K,Vmle,'nmle)- 

(3.11) 

The definitions show that inequahty in (3.11) must be due to alternate train- 
ing transcriptions w G W \ Vmie satisfying 

and, for k with < < 8, 

9ek+i{x I w;7^0J = ge,+A^ \ w; 7^„^e) > 0. 

Recall that the approximate MMI criterion pg^^-^{w \ a;; k, V^, 7^^) is essen- 
tially the same as Pe^j^^{w \ x; k). However, we have shown that the approxi- 
mate MMI criteria Pe^+^iw \ x; n, Vmie, Tlk) and Pe^+ii^ \ x; k, Vmie, Tlmie) are 
overly optimistic relative to pe^+i ("^ I x] n) solely because the corresponding 
distributions place zero probability on transcriptions that are not in Vmie but 
have non-zero probability under pe^+^i' \ x; n). 

Figures [TO] and [TT] show that the test set WERs and the approximate 
MMI criteria are very similar for the first 27 iterations when we use the ap- 
proximations pefc+i(ti' I X] K,Vmle,Tlmle) Blld pg^^^^{w \ X] K,Vmle,Tlek) ■ ThuS 

a similar argument shows that it is not until after iteration 28 that ig- 
noring state sequences in 5 \ TZmie starts to degrade the approximation in 

P9k+i{w I X] K, Vmie, T^mle) ■ 

Finally, Figures [TO] and 11 also make clear how important the regenerating 
the phone-marks are to good asymptotic behavior for lattice-based MMI. 



4 Discussion 



We showed in Section 3.2 that if we iteratively estimate model parameters 
using extended Baum- Welch and an approximate MMI criterion - pg{w \ 
x; K, Vgf,,7l0^) at iteration k + 1 - that is a consistently good approximation 
to pe{w I x; k), then the resulting behavior is nearly ideal. The WERs on the 
training and test data decline while the approximate MMI criterion rises. In 
fact the minimum WER on the test set is lower - albeit by a small amount - 
than when we use the standard approximate MMI criterion. The sequences 
of WERs and approximate MMI criteria do not appear to actually converge. 
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Figure 11: Comparison of the approximate MMI criterion during 100 itera- 
tions of extended Baum- Welch with the three types of input lattices. The 
X-axis gives the extended Baum- Welch iteration, with x=0 being the mle. 
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but instead oscillate within a small range of values, which suggests that 
the underlying parameters will asymptotically orbit a compact limit set. In 
particular, there is no evidence that anything like 'over fitting' is occurring. 
On the contrary, this suggests that parameter estimation using the ideal MMI 
criterion, p0{w \ x; k), and the extended Baum- Welch algorithm should have 
properties that are nearly as good as maximum likelihood estimation using 
the Baum- Welch algorithm. 



In contrast, we showed in Section 3.1 that the properties of the standard 



approximation to the MMI criterion, namely p0{w \ x; K,Vmie,T^mie), make 
it unsuitable as a model selection criterion. First of all, when 9 = 6mie the 
approximation 

P0{W I X]k) ^pe{w I X] Kyrnle^Tlmle) (4.1) 

is essentially an equality and this approximation remains good while 9 is 
close to Omie- However, we showed that as 9 moves away from 9mie the two 
distributions poi ) and Vf)( ) become very different. 

This is because, by construction, | x; k, Vmie, T^mie) ignores transcriptions 
in W \ Vmie and state sequences in 5 \ TZmie that have zero probability under 
our model at 9mie- As 9 moves away from 9mie these ignored transcriptions 
and state sequences start to accumulate non-trivial probability mass, leading 
this approximate MMI criterion to consistently overestimate peiw \ x;k). 



Interestingly, we showed in Section 3.3 and Section 3.4 that the ignored state 



sequences appear to contribute just as much as the ignored transcriptions do 
to the discrepancy between pe{w \ x; K,Vmie,'J^mie) and pg{w \ x; k) when 9 
is far from 9mie- Second of all, and to make matters worse, we also showed 
that the approximate MMI criterion peiw \ x; K,Vmie,'^mie) appears to have 
a maximum value of 1 (i.e., the approximate MMI criterion places all the 
probability mass on the reference transcription w) when ^ is at a point at 
infinity. This means that estimating model parameters using 

9 = argmaxpe(w | x; n, Vmie, Tlmie) 

6^(3 

appears to be an ill-posed problem. This also means that using P0{w \ 

model selection criterion will produce estimates 9 far 



away from 9mie where (4.1) is no longer a good approximation. 



Our explanation for the behavior of lattice-based MMI is in terms of the 



quality of the approximation (4.1 ) and the properties ofpg{w \ x; k, Vmie, T^mie) 
as a model selection criterion. For the first several iterations of extended 
Baum- Welch, the estimated parameters, 9k, stay close enough to 9mie so that 
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the approximation (4.1) is still good. As we have shown, parameter estima- 
tion using a model selection criterion close to pe{w \ x; n) is well behaved, and 
the first several iterations of model estimation using the approximate MMI 
criterion pei.'U) \ x; K,Vmie,T^mie) is well behaved too: WERs on the training 
and test set steadily decline, while the approximate MMI criterion steadily 
increases. However as the number of iterations of extended Baum- Welch 
increases beyond 20, we have shown that the approximation (4.1) degrades 
and the distributions pe{- \ x; n) and | x; i^yVmieyT^mie) become increas- 
ingly unrelated. The WER on the training data performed by using the 
distribution P0fe(' I n,Vmie,T^mie) (method B in Section 3.1) steadily de- 
clines and the related approximate MMI criterion steadily increases, which 
shows that extended Baum- Welch is correctly optimizing the approximate 
MMI criterion. However, the distribution that we really care about is not 
P6)j.(- I x; K, Vmie, T^mie) , but rather what is now the very different distribution 
P6)j.(- I x; k). The WER on the training data using the distribution pg{- \ x; k) 
(method A in Section 3.1 ) begins to get steadily worse as do WERs on the test 
data. This is not due to 'over fitting', but instead the natural consequence 
of using an ill-behaved approximation. 

We also believe that this is the explanation for the behavior that was 
labeled 'over fitting' in all the earlier versions of MMI. Except in the case of 
tasks like isolated digits, it has never been possible to sum over all possible 
transcriptions or state sequences which the definition of the earlier MMI cri- 
terion, p0{w I x), also requires. Instead, all the early versions of MMI used 
approximations to this MMI criterion, say pe^w \ x; Vmie, T^mie), analogous to 
the approximation that we have studied in this paper. It would be very sur- 
prising if the earlier approximations exhibited markedly different properties 
than those that we have shown pg{w \ x; k, Vmie, T^mie) exhibits. 

One approach to improving lattice-based MMI would be to come up with 
a better approximation to pe{w \ x;k). There are two directions that one 
could pursue. The first would be to come up a better functional form for 
the approximation itself. This could involve something as simple as adding 
a regularization term to prevent the model parameters from straying too 
far from the mle or a deeper analysis of what useful properties pe{w \ x;k) 
has that can be captured in practical form. The second would involve a 
better, quantitative understanding of how close 9 needs to be to 9mie in order 
for the approximation (4.1) to be valid. With that knowledge, instead of 
regenerating the lattices between each iteration of extended Baum- Welch, 
we could regenerate the lattices only as needed based on how far the model 
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parameters have moved and our estimate of how much (4.1 ) has deteriorated. 
A related issue is that it would be extremely useful to have a quantitative 
measure of how 'good' the lattices need to be for (4.1 ) to be valid. Although 



pTj contains some useful guidance in this regard, this remains part of the 
'black art' of lattice-based MMI. 

Finally, we believe that this paper illustrates a fruitful area of research 
that is usually ignored in the speech recognition literature. For a variety 
of reasons, we tend to treat our estimation algorithms as 'black boxes'. A 
great deal of attention is paid to the external responses of the algorithm to 
stimuli in the form of training and test data. Especially important responses 
to these stimuli are computational efficiency and word error rate. In fact, we 
tend to develop algorithms expressly to lower the word error rate, often by 
tinkering with or combining extant algorithms. We also tend to prefer one 
algorithm over another based on these considerations, which leads to a form 
of word error rate and computational efficiency based natural selection in the 
evolution of algorithms. While this state of affairs is perfectly reasonable, 
indeed it has led to remarkable progress in field of speech recognition over the 
last twenty years, what goes on inside the resulting rather intricate collection 
of black boxes has rarely if ever been examined. For example, the models 
and their parameters typically live deep inside the black box, which means 
that asymptotic properties of the algorithm at the model parameter level 
are also hidden from view. Another example is the overall effect of the 
many approximations that are used to make computations feasible. When 
taken individually these effects are well understood, but their overall effect 
in combination is often overlooked or even ignored when theoretical analysis 
is undertaken. Lattice-based MMI is a perfect example of one of these black 
boxes. It is a complicated algorithm with many facets that has evolved over 
more than twenty years, the evolution driven by the goal of lowering the 
word error rate. When considered in the light of word error rate reduction, 
lattice-based MMI is remarkably successful. However, this paper shows that 
when we peer deeper inside this particular black box we see that the model 
parameters are behaving pathologically. We were not only able to diagnose 
the problem - a flaw in a key approximation - but we were also able to 
correct it, which results in more stable model parameter behavior and a small 
improvement in the word error rate. We intend to continue investigations of 
this sort on other important algorithms used in speech recognition. 
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A What happens to the model parameters? 



This Section presents further analysis of the results in Section 3A We 
will develop a simple framework that we will use to examine what hap- 
pens to the model parameters after 100 iterations of extended Baum- Welch. 
We shall demonstrate that extended Baum- Welch expands the space that 
the model means occupy and that this expansion appears to be consis- 
tent with steady decrease in the sequences {imm{x,w; 9k, i^yVmicT^mie)) and 



{den{x,w;9k, K^Vmie^T^mie)) that we saw in Section 3.1 We also show that 
extended Baum- Welch appears to be shrinking the model variances, but not 
in a dramatic way. 

We begin with a review of some useful properties of ellipsoids in M''. Let 
H he a d X d positive definite matrix. Then the inverse matrix defines a 

E- 



metric, dy-i(-, on M'^ via 



This is called the Mahalanohis distance. The set of points in satisfying 
(x, 0) = 1 is an ellipsoid 

Ell{i:-^) = {xeW^ : = 1}, 

with volume given by 



Volume(E//(S-^)) = Volume ( S"*" ^ ) Vdet S, 

where S'^~^ = {x G M"^ : ||x|| = 1}. Let {Ajjf^]^ and {Mjlf^i be the eigenval- 
ues and corresponding choices for orthonormal eigenvectors for E, where for 
convenience we have indexed the eigenvalues so that 

Ai < A2 < ■ ■ ■ < Ad. 



Then the collection of vectors {V^'Ui}f=i are the semi-principal axes of 
ElliT,^^). The ratio c(S) = a/A^/Ai gives a sense of how elongated the 
ellipsoid £;//(S-i) is: ^XJ^ > 1 with equality ^ Ell{J:'^) = S'^-^Q 

^°The quantity is the condition number of the matrix S. Also -^/l — 1/c^ is the 
eccentricity of the two-dimensional ellipse formed by the intersection of Ell(Yi^^) with the 
plane through the origin spanned by ui and Ud- 
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First we examine the scatter of the collection of the 1500 state means 
in M^^. We treat these means as points in M^^ and compute the total mean 
vector (or centroid) 

^ 1500 ' 
and total variance or scatter matrixEE 



1500 

Let Tmle be computed using the mle means and Tmmi be computed using 
the means after 100 iterations of extended Baum- Welch. The difference in 
the volume and elongation in the ellipsoids EU{T^/j^^) and EU{T^j\,jj) is a 
crude measure of the effect 100 iterations of extended Baum- Welch has on 
the means. The centroid does not move appreciably, but the volume that 
the means occupy increases by a factor of 2.6 x 10^. Also, the measure of 
elongation starts out at c(Tmle) = 950 but after 100 iterations of extended 
Baum- Welch is decreases significantly to c(Tmmi) = 57. We would also like 
to inspect visually what is happening to the means. Let's start with the mle. 
We can project the 39-dimensional cloud of mle means onto the 2-dimensional 
plane spanned by the eigenvectors ui{MLE) and u^g^MLE) for Tmle- We 
can do the same thing with the means after 100 iterations of extended Baum- 
Welch, but this time projecting onto the different plane spanned by ui{MMI) 



and U3g{M M I) for Tmmi- Figure 12 compares these two projections. Even 



though these projections are onto different 2-dimensional planes in M^^, we 
can still see that after 100 iterations of extended Baum- Welch: 

1. The spread in the cloud of means does not change appreciably in the 
direction of maximum variation (along the major axis). 

2. The cloud of means becomes more spherical by becoming more spread 
out in the direction of the minimum variation (along the minor axis). 



Note that that the cloud of mle means in Figure 12 gives a sense of where 



the cloud of training data live. It appears from Figure 12 that 100 iterations 



of extended Baum- Welch has, in the direction of the minimum variation. 



^^Note that the scatter matrix for the means, T, is the similar to the between class 
variation matrix. 
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Figure 12: Mean scatter along the top and bottom eigenvectors before and 
after 100 iterations of extended Baum- Welch. 



pushed a significant fraction of the means well beyond the location of the 
training data. 

We conclude that one effect of MMI is an expansion of the space that 
the model means occupy. We also speculate that the direction of minimum 
variation in the model mean scatter is the least useful for discrimination, i.e. 
recognition, since it it also the direction of minimum variation in the between 
class scatter. If this speculation is correct, then MMI is pushing apart the 
means the most in the direction that matters the least to recognition. 

Next we examine what happens to the model variances after 100 iterations 
of extended Baum- Welch. Recall we are using diagonal covariance matrices 
in our normal output distributions. 

To start with we shall examine histograms of the model standard devia- 
tions before and after extended Baum- Welch. Instead of creating 39 separate 
histograms, one for each of the features, would like to create just one his- 
togram that combines the model standard deviations in all of the 39 feature 
dimensions. To do this we rescale each feature so that its total variation 
on the training data is 1, and rescale the standard deviations accordingly. 
After this rescaling, the deviations is different feature dimensions are roughly 
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Standard deviation 



Figure 13: Densities of the scaled standard deviations before (blue) and after 
(black) 100 iterations of extended Baum- Welch. 
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commensurable. Figure [13] displays densities fit to the resulting histograms. 
It appears that the standard deviations have shrunk somewhat after 100 it- 
erations of extended Baum- Welch, although the effect is not dramatic. It is 
worth noting that none of the mle model variances were floored, while only 
10 out of the approximately 59,000 variances are floored after 100 iterations 
of extended Baum- Welch. 

For each state j let amiej and ammij denote the vectors of standard de- 
viations for state j before and after 100 iterations of extended Baum- Welch, 
and deflne 



V, = log 



n39 
1=1 ^mmi,j,i 
F39~ 



For each j the quantity 11^=1 ^mie,j,i gives the volume of the 39-dimensional 
rectangular parallelepiped with edges specifled by the elements of the vector 
<7mie,j', this is also proportional to the volume of the ellipsoid EU{diag{a'^i^ j)) . 
So the quantity Vj measures how this volume changes after 100 iterations 



of extended Baum- Welch. Figure 14 displays a histogram of the collection 
{Vjlj^^^, three quarters of which are negative. 

Recall that, modulo constants, the log-likelihood of a frame y G M.^^ with 
respect to N{fi, cr^) is 



1 •^y / \2 ■^y 



2 U "i 

and that the numerator and denominator log-likelihoods are dominated by 
state speciflc versions of S. Moreover, Figure [5] in Section shows that after 
100 iterations of extended Baum- Welch 

den(x, w; 61100, /^,Vm/e,7^„;e) < den(x, V^/e, 7^„^e)• (A.l) 

The expansion in the model means that we have observed is consistent with 



(A.l ), since it appears that in at least some dimensions the means are moving 
away from the data which would make the flrst sum in terms like S more 
negative. However, the our crude analysis of the behavior of the variances 



is mixed relative to (A.l). On the one hand the standard deviations appear 
to be shrinking after 100 iterations of extended Baum- Welch, which would 
again make the flrst sum in terms like 5* more negative, but on the other 
hand we saw that the volume terms also appear to be shrinking which would 
make the second sum in terms like S less negative. 
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Figure 14: Histogram of {T^})=i°- 
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B Experiments with E 



In this Section we explore what effect the extended Baum- Welch parameter 
E has on the results in Section 3.1 The parameter E controls two properties 
of extended Baum- Welch. On the one hand, if E is large enough, then each 
iteration of extended Baum- Welch will result in an increase in the MMI 
criterion ([6], [1]). On the other hand, E may be thought of as a smoothing 
parameter that controls how related the parameters 6k+i are to the previous 
parameters 6k- A small value for E will produce a larger difference ||6'fc+i — 
9k\\ than a large value for E will produce. This has led to the belief that 
choosing E large enough not only ensures the convergence of the sequence 
of approximate MMI criteria {poki.'^ I n^Vmie^Tlrnie)) but also ensures the 
convergence of the underlying sequence of parameters {Ok). 

We repeat the experiment that used fixed phone-marked word lattices 
generated by the mle and E = 1.0 described in Section 3.1 but this time 
with three other choices for E. Figure [15] shows how the MMI criterion 
varies as extended Baum- Welch proceeds when E is set to 0.5, 1.0, 1.5, and 
2.0. Note that E = 0.5 is too small to ensure that the approximate MMI 
criterion increases at each iteration of extended Baum- Welch. Note also that 
as the value of E increases the corresponding trajectories in the approximate 
MMI criterion become smoother. Finally, note that the trajectory in the 
approximate MMI criterion when E = 2 looks remarkably similar to the 
corresponding trajectory when E = 1.0, albeit smoother and shifted to the 
right. 

Figure 16 and Table |4] show how the WER on the independent test varies 
as extended Baum-Welch proceeds when E is set to 0.5, 1.0, 1.5, and 2.0. 
Larger values of E appear to be just delaying the inevitable rise in the se- 
quence of WERs. When E is set to 0.5, 1.0, or 1.5, the WER at iteration 100 
is larger than that of the MLE. We also ran 100 more iterations of extended 
Baum-Welch when E = 2.0. In this case the WER on the independent test 
set is 42.8% at iteration 200, which is nearly the same as the WER when 
E = 1.0 at iteration 100, namely 42.2%. Roughly speaking, as the reader 
can verify from Figure 16 or Table |4| the WER at iteration k with E = ei 
will be the same at iteration —k with E = eo- 

ei ^ 

While choosing E > 1 does ensure that each iteration of extended Baum- 
Welch results in an increase in the approximate MMI criterion, values of E 
larger than 1 only appear to delay the pathological behavior that we observed 
in Section \3A\ 
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Figure 15: The MMI criterion on the training data during 100 iterations 
of extended Baum- Welch using different values for E. The x-axis gives the 
extended Baum- Welch iteration, with x=0 being the mle. 
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Figure 16: The WER measured on the test set during 100 iterations of 
extended Baum- Welch using different values for E. The x-axis gives the 
extended Baum- Welch iteration, with x=0 being the mle. 
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iteration 


E 


0.5 


1.0 


1.5 


2.0 


mle 


17.7 


17.7 


17.7 


17.7 


10 




lis 
ii.o 


ii.y 




zu 


14.2 


12.0 


11.6 


11.7 


30 


15.2 


12.6 


11.9 


11.7 


40 


19.6 


13.8 


12.0 


11.8 


50 


29.0 


15.0 


12.6 


12.0 


60 


38.6 


17.2 


13.7 


12.1 


70 


45.3 


20.4 


14.4 


12.8 


80 


49.7 


26.7 


16.1 


13.6 


90 


54.6 


34.4 


17.6 


14.3 


100 


52.2 


42.2 


19.9 


15.0 



Table 4: WER on the test set during 100 iterations of extended Baum- Welch 
using different values for E. 



C MPE 

Minimum phone error (MPE) training ([H]) uses yet another model selection 
criterion coupled with an iterative estimation algorithm that is essentially 
the same as extended Baum- Welch. Given w G W then we define A(w, w) to 
be the phone accuracy of the transcription w measured relative to the true 
transcription w and a dictionary that provides phone level pronunciations for 
each transcription in W. Then the MPE criterion, J^mpe, is defined at each 
e to be 

J^mpe{0) = ^ pe(w I x; k)A(w,w). 
wew 

In this Section we report preliminary results on an experiment analogous 
to the experiment in Section 3.1[ we follow the standard procedure in the 



literature by generating phone-marked word lattices once and all using the 
mle, but this time we use an approximation to the MPE criterion rather 
than an approximation to the MMI criterion for model selection during 100 
iterations of extended Baum- Welch. Since we will use fixed phone-marked 
word lattices for each pass of extended Baum- Welch, this means that the 
corresponding approximate MPE criterion that we use for model selection is 
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Figure 17: The WER on an independent test set and the MPE criterion on 
the training data during 100 iterations of extended Baum- Welch. The x-axis 
gives the extended Baum- Welch iteration, with x=0 being initial models. 



given by: 



Figure [17] displays the WER on the independent test set and the approx- 
imate MPE criterion as extended Baum- Welch proceeds Since Figure 17 



only displays the WER every tenth iteration, we note that the WER reaches 
a minimum WER of 11.7% on iteration 6, which the reader will recall is 



similar to the performance that we observed in Section 3.1 using MMI. 



^^We use HTK to perform this experiment with standard settings for MPE, namely 
E = 2.0 and r ^ 50. 
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One might expect MPE to be less prone to the problems that we observed 
with MMI in Section 3J^, since the MPE criterion is in a sense self- regulating: 
if we push the model parameters, 9 too far away from the mle, then presum- 
ably this will induce phone level errors which will result in a lower value for 
J-'mpe{9; k, VmieyT^mie)- Unfortunately, this self-regulation is not sufficient to 
prevent the pathological behavior exhibited in Figure 17 The overall trend 
in the approximate MPE criterion is upward, but, after iteration 50 so is the 
overall trend in test set WER. 
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